Regression without regrets: Workflow of initial data analyses

1 Overview

The focus of this document/website is to provide guidance on conducting initial data analysis in a reproducible manner in the context of intended regression analyses.

TODO: to add. create ToC dynamically:

2 IDA Framework

The IDA framework consists of six steps [Huebner et al 2018, Figure 1], here we assume that metadata (step I) exist in sufficient detail, and that data cleaning (step II) was already performed. Metadata summarize background information about the data to properly conduct IDA steps, and a data cleaning process identifies and corrects technical errors. The data screening (step III) examines data properties to inform decisions about the intended analysis. Initial data reporting (step IV) document insight of the previous steps and can be referred to when interpreting results from the regression modeling. Consequences of these analyses can be that the analysis plan needs to be refined or updated (step V). Finally, reporting of IDA results in research papers (step VI) are necessary to ensure transparency regarding key findings that influence the analysis or interpretation of results. Further details about the elements of IDA are discussed in [TG3 papers].

IDA framework

IDA framework

References

Huebner M, le Cessie S, Schmidt CO, Vach W . A contemporary conceptual framework for initial data analysis. Observational Studies 2018; 4: 171-192. Link

Huebner M, Vach W, le Cessie S, Schmidt C, Lusa L. Hidden Analyses: a review of reporting practice and recommendations for more transparent reporting of initial data analyses. BMC Med Res Meth 2020; 20:61. Link

3 Scope of the regression analyses for the examples

Regression models can be used for a wide range of purposes, for the purpose of these examples the assumptions on the regression analysis set-up in this paper are listed in Table 1. Thus, IDA tasks will be explained in a well-defined, practically relevant setting. Since a key principle is that IDA does not touch the research question no associations between dependent (outcome) and independent (non-outcome) variables are considered.

Table 1: The scope of the regression analyses considered for IDA tasks

Aspects of the research plan Assumptions in this paper Reason for the assumption
Dependent (outcome) variable One dependent variable that can be continuous or binary; exclude time-to-event or longitudinal outcomes Explain IDA tasks in a well-defined, practically relevant setting
Regression models Models with linear predictors Explain IDA tasks in a well-defined, practically relevant setting
Purpose of regression model Adjust effect of one variable of interest for confounders; quantify the effects of explanatory variables on the outcome Explain IDA tasks in a well-defined, practically relevant setting
Independent variables “explanatory” or “confounder” depending on purpose of model; small to moderate number of mixed types; Not high dimensional; no repeated measurements To demonstrate IDA approaches for a mix of variables likely to be encountered in practice
Statistical analysis plan Exists, defines the outcome variable, the type of regression model to be used, and a set of independent variables IDA does not touch the research question, but may lead to an update or refinement of the analysis plan

References:

Vach W. Regression Models as a Tool in Medical Research. Chapman/Hall CRC 2012

Harrell FE. Regression Modeling Strategies. Springer (2nd ed) 2015

Royston P and Sauerbrei W. Multivariable Model Building. Wiley (2008)

[…]

4 Data screening and possible actions

TODO: Check for copy and paste errors in table.

4.1 Univariate distributions

What to look at Possible actions: Interpretation Possible actions: SAP Possible actions: Presentation
Continuous variables General skewness Help in interpreting results Update SAP Update intended presentation of results
Continuous variables General skewness Wide CI for coefficients Use variable as log-transformed Update intended presentation of results
Continuous variables Outliers Disproportional impact on results Winsorize or transform Model involves winsorization
Continuous variables Spike at 0 Narrow CI at 0 Use appropriate representation of variable in model Use 2 (or more) coefficients to distinguish 0 from non-0 continuous part
Categorical variables Frequencies Comparisons to default reference probably irrelevant Change reference category Contrasts compare to (new) reference category
Categorical variables Rare categories Wide CI for coefficients Collapse/exclude Fewer categories to present
Categorical variables One very frequent category Comparisons irrelevant? Exclude variable Variable omitted

4.2 Bivariate distributions

What to look at Possible actions: Interpretation Possible actions: SAP Possible actions: Presentation
Continuous by continuous Outliers (from the cloud) Disproportional impact on results Winsorize or transform Model involves winsorization
Continuous by continuous Correlations Wide CI for coefficients Winsorize or transform Model involves winsorization
Continuous by categorical Outliers (only visible in bivariate plot) Wide CI for coefficients
Categorical by categorical Frequent/rare combinations Comparison to default reference irrelevant Change reference category Contrasts compare to (new) reference category
Categorical by categorical Frequent/rare combinations interactions relevant? Remove interaction from model Fewer interactions to present

4.3 Missing values

What to look at Possible actions: Interpretation Possible actions: SAP Possible actions: Presentation
Per variable Number and proportion Wide CI for coefficients Remove variable if many missing values
Pattern Variables missing independently or together Omit variables together Changes model
Pattern Variables missing dependent on levels of other variables Systematic missingness? Model still based on representative? IPW needed? Weighted analysis
Complete cases Number and proportion Few cases left for main CCO analysis Multiple imputation (or other way of dealing with missing values)? Result from MI analysis? Or applicability restricted to a subpopulation?

References

Huebner M, le Cessie S, Schmidt CO, Vach W . A contemporary conceptual framework for initial data analysis. Observational Studies 2018; 4: 171-192. Link

Harrell

[…]

CRASH-2

Since a key principle of IDA is not to touch the research questions, before IDA commences the research aim and statistical analysis plan need to be in place. IDA may lead to an update or refinement of the analysis plan. To demonstrate the workflow and content of IDA, we created a hypothetical research aim and corresponding statistical analysis plan, which is described in more detail in the section [SAPcrash2.Rmd].

Hypothetical research aim for IDA is to develop a multivariable model for early death (death within 28 days from injury) using nine independent variables of mixed type (continuous, categorical, semicontinuous) with the primary aim of prediction and a secondary aim of describing the association of each variable with the outcome.

A prediction model was developed and validated based on this data set in “Predicting early death in patients with traumatic bleeding” Perel et al, BMJ 2012, [supplement available at]. The assumed research aim is in line with the prediction model

4.4 Introduction to CRASH-2

Description: Clinical Randomisation of an Antifibrinolyticin Significant Haemorrhage(CRASH-2) was a large randomised placebo controlled trial among trauma patients with, or at risk of, significant haemorrhage, of the effects of antifibrinolytic treatment on death and transfusion requirement. The study is described at the original trial website. A public version of the data set is found at a repository of public data sets hosted by the Vanderbilt University’s Department of Biostatistics (Prof. Frank Harrell Jr.).

The data set includes 20,207 patients and 44 variables.

Note: In contrast to the analysis described in Perel et al, variables describing the economic region and the treatment allocation are missing in the public version of the data set, and while the data set contains 20,207 patients, the research paper mentions 20,127 patients having been included in the study.

4.5 Crash2 dataset contents

4.5.1 Source dataset

Display the source dataset contents. The dataset is in the data-raw folder of the project directory.

TODO: Move the contents of the original data set to an appendix? IS it relevant for us? No, it looks good here-MH


Data frame:crash2

20207 observations and 44 variables, maximum # NAs:17121  
NameLabelsUnitsLevelsClassStorageNAs
entryidUnique Numbers for Entry Formsintegerinteger 0
sourceMethod of Transmission of Entry Form to CC 5integer 0
trandomisedDate of RandomizationDatedouble 0
outcomeidUnique Number From Outcome Databaseintegerinteger 80
sex 2integer 1
ageinteger 4
injurytimeHours Since Injurynumericdouble 11
injurytype 3integer 0
sbpSystolic Blood PressuremmHgintegerinteger 320
rrRespiratory Rate/minintegerinteger 191
ccCentral Capillary Refille Timesintegerinteger 611
hrHeart Rate/minintegerinteger 137
gcseyeGlasgow Coma Score Eye Openingintegerinteger 732
gcsmotorGlasgow Coma Score Motor Responseintegerinteger 732
gcsverbalGlasgow Coma Score Verbal Responseintegerinteger 735
gcsGlasgow Coma Score Totalintegerinteger 23
ddeathDate of DeathDatedouble17121
causeMain Cause of Death 7integer17118
scauseotherDescription of Other Cause of Death227integer 0
statusStatus of Patient at Outcome if Alive 3integer 3169
ddischargeDate of discharge, transfer to other hospital or day 28 from randomizationDatedouble 3185
conditionCondition of Patient at Outcome if Alive 5integer 3251
ndaysicuNumber of Days Spent in ICUnumericdouble 182
bheadinjSignificant Head Injuryintegerinteger 80
bneuroNeurosurgery Doneintegerinteger 80
bchestChest Surgery Doneintegerinteger 80
babdomenAbdominal Surgery Doneintegerinteger 80
bpelvisPelvis Surgery Doneintegerinteger 80
bpePulmonary Embolismintegerinteger 80
bdvtDeep Vein Thrombosisintegerinteger 80
bstrokeStrokeintegerinteger 80
bbleedSurgery for Bleedingintegerinteger 80
bmiMyocardial Infarctionintegerinteger 80
bgiGastrointestinal Bleedingintegerinteger 80
bloadingComplete Loading Dose of Trial Drug Givenintegerinteger 80
bmaintComplete Maintenance Dose of Trial Drug Givenintegerinteger 80
btransfBlood Products Transfusionintegerinteger 80
ncellNumber of Units of Red Call Products Transfusednumericdouble 9963
nplasmaNumber of Units of Fresh Frozen Plasma Transfusedintegerinteger 9964
nplateletsNumber of Units of Platelets Transfusedintegerinteger 9964
ncryoNumber of Units of Cryoprecipitate Transfusedintegerinteger 9964
bviiRecombinant Factor VIIa Givenintegerinteger 374
boxidTreatment Box Numberintegerinteger 0
packnumTreatment Pack Numberintegerinteger 0

VariableLevels
sourcetelephone
telephone entered manually
electronic CRF by email
paper CRF enteredd in electronic CRF
electronic CRF
sexmale
female
injurytypeblunt
penetrating
blunt and penetrating
causebleeding
head injury
myocardial infarction
stroke
pulmonary embolism
multi organ failure
other
scauseother
Acute Hypoxia
ACUTE LUNG INJURY
Acute Pulmonary Oedema
Acute Renal Failure
ACUTE RESPIRATORY DISTRESS SYNDROME (ARDS)
acute respiratory failure
acute respiratory failure+sepsis
air amboli (embolism)
Air embolism caused by penetrating lung trauma
...
statusdischarged
still in hospital
transferred to other hospital
conditionno symptoms
minor symptoms
some restriction in lifestyle but independent
dependent, but not requiring constant attention
fully dependent, requiring attention day and night

4.5.2 Updated analysis dataset

Additional meta-data is added to the original source data set. We write this new modified data set back to the data folder after adding additional meta-data for the following variables:

  • age - add label “Age” and unit “years”.
  • injury time - add unit “hours”.
  • total Glasgow coma score - add unit “points”.

TODO:

  • Do we want to select patients at this point or leave this for the analysis phase?
  • Do we also want to do a selection of variables here to take in to the IDA phase? i.e. drop variables we do not check in IDA?

As a cross check we display the contents again to ensure the additional data is added, and then write back the changes to the data folder in the file “data/a_crash2.rds”.

Input object size: 1221480 bytes; 12 variables 20207 observations New object size: 1223272 bytes; 12 variables 20207 observations

Input object size: 1546808 bytes; 14 variables 20207 observations New object size: 1385720 bytes; 14 variables 20207 observations


Data frame:a_crash2

20207 observations and 14 variables, maximum # NAs:17121  
NameLabelsUnitsLevelsClassStorageNAs
entryidUnique Numbers for Entry Formsintegerinteger 0
trandomisedDate of RandomizationDatedouble 0
ddeathDate of DeathDatedouble17121
ageAgeyearsintegerinteger 4
sexSex2integer 1
sbpSystolic Blood PressuremmHgintegerinteger 320
hrHeart Rate/minintegerinteger 137
rrRespiratory Rate/minintegerinteger 191
gcsGlasgow Coma Score Totalpointsintegerinteger 23
ccCentral Capillary Refille Timesintegerinteger 611
injurytimeHours Since Injuryhoursnumericdouble 11
injurytypeInjury type3integer 0
time2deathinteger17121
earlydeathDeath within 28 days from injuryintegerinteger 0

VariableLevels
sexmale
female
injurytypeblunt
penetrating
blunt and penetrating

5 Statistical analysis plan

Since a key principle of IDA is not to touch the research questions, before IDA commences the research aim and statistical analysis plan needs to be in place. IDA may lead to an update or refinement of the analysis plan. To demonstrate the workflow and content of IDA, we created a hypothetical research aim and corresponding statistical analysis plan.

Hypothetical research aim for IDA: Develop a multivariable model for early death (death within 28 days from injury) using nine independent variables of mixed type (continuous, categorical, semicontinuous) with the primary aim of prediction and a secondary aim of describing the association of each variable with the outcome.

The assumed analysis aim is in line with the prediction model presented by Perel et al, BMJ 2012, supplement available at.

5.1 Outcome

Early death, i.e. in-hospital death within 28 days from injury (binary variable)

5.2 Statistical methods

Logistic regression will be used to model early death by the following independent variables (measured at randomisation) deemed important to predict early death.

Demographic measurements:

  • Age (age, years)
  • Sex (sex, male or female)

Physiological measurements:

  • Systolic blood pressure (sbp, mmHg)
  • Heart rate (hr, 1/min)
  • Respiratory rate (rr, 1/min)
  • Glasgow coma score (gcs, points)
  • Central capillary refill time (cc, seconds)

Characteristics of injury measurements:

  • Time since injury (injurytime, hours)
  • Type of injury (injurytype, ‘blunt’, ‘penetrating’ or ‘blunt and penetrating’)

Restricted cubic splines with 3 degrees of freedom with knots set to default values will be used for continuous variables. As the final prediction model should be parsimonious enough to simplify its application, a backward elimination algorithm with a significance level set at \(\alpha=0.05\) will be applied to remove statistically insignificant effects. Finally, nonlinear representation of each continuous variable will be tested against linear representation at \(\alpha=0.05\). In case of lacking added value of a nonlinear effect, the model will be refitted with a linear effect for that variable.

5.3 Remarks

  • Regarding type of injury, the original paper describes its treatment in the model as follows: ‘Type of injury had three categories—-penetrating, blunt, or blunt and penetrating—but we analysed it as ’penetrating’ or ‘blunt and penetrating.’ ’ It is not clear from that description what happened to the ‘blunt’ group. (I assume they were collapsed with ‘blunt and penetrating’.) ** we are going to consider the three categories, and then check aout recommendations for the final analysis-MH**

  • The original paper describes the modeling approach as follows: ‘We used a backward step-wise approach. Firstly, we included all potential prognostic factors and interaction terms that users considered plausible. These interactions included all potential predictors with type of injury, time since injury, and age. We then removed, one at a time, terms for which we found no strong evidence of an association, judged according to the P values (<0.05) from the Wald test.’ This would mean they tested at least 24 interaction terms, each possibly using several degrees of freedom! In the final model, only an interaction of Glasgow coma score and type of injury was included.

5.4 Preparations

The outcome variable, early death (i.e., death within 28 days from injury) must be computed from the time span between date of death and date of randomization using the following logic:

  • transform ddeath and trandomisation into an interpretable date format and then compute the difference
  • interpret missing (i.e. NAs) as ‘not died within study period, at least not within 28 days’
  • if patients died after 28 days, treat as alive

This can be derived using the following code:

Characteristic N = 202071
Death within 28 days from injury 3076 (15%)

1 Statistics presented: n (%)

The number of deaths computed in the data set coincides with the number reported in Perel et al, BMJ 2012.

5.5 Sources

Data obtained from https://hbiostat.org/data/

LINK to data set

5.5.1 Data dictionary

The data dictionary can be found LINK

5.6 References

CRASH-2 Collaborators. Effects of tranexamic acid on death, vascular occlusive events, and blood transfusion in trauma patients with significant haemorrhage (CRASH-2): a randomised, placebo-controlled trial. Lancet 2010;376:23-32

Perel P, Prieto-Merino D, Shakur H, Clayton T, Lecky F, Bouamra O, Russell R, Faulkner M, Steyerberg EW, Roberts I. Predicting early death in patients with traumatic bleeding: development and validation of prognostic model. BMJ 2012; 345(aug15 1): e5166.

5.7 Session info

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] gtsummary_1.2.6 Hmisc_4.4-0     Formula_1.2-3   survival_3.2-3 
##  [5] lattice_0.20-40 forcats_0.5.0   stringr_1.4.0   dplyr_0.8.5    
##  [9] purrr_0.3.4     readr_1.3.1     tidyr_1.0.2     tibble_3.0.1   
## [13] ggplot2_3.3.0   tidyverse_1.3.0 here_0.1       
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.1          sass_0.2.0          jsonlite_1.6.1     
##  [4] splines_3.6.1       modelr_0.1.6        assertthat_0.2.1   
##  [7] latticeExtra_0.6-29 cellranger_1.1.0    yaml_2.2.1         
## [10] pillar_1.4.4        backports_1.1.7     glue_1.4.1         
## [13] digest_0.6.25       RColorBrewer_1.1-2  checkmate_2.0.0    
## [16] rvest_0.3.5         colorspace_1.4-1    htmltools_0.4.0    
## [19] Matrix_1.2-18       pkgconfig_2.0.3     broom_0.5.5        
## [22] haven_2.2.0         bookdown_0.18       scales_1.1.1       
## [25] jpeg_0.1-8.1        htmlTable_1.13.3    generics_0.0.2     
## [28] ellipsis_0.3.0      withr_2.2.0         nnet_7.3-13        
## [31] cli_2.0.2           magrittr_1.5        crayon_1.3.4       
## [34] readxl_1.3.1        evaluate_0.14       fs_1.3.2           
## [37] fansi_0.4.1         nlme_3.1-145        xml2_1.2.5         
## [40] foreign_0.8-76      tools_3.6.1         data.table_1.12.8  
## [43] hms_0.5.3           lifecycle_0.2.0     munsell_0.5.0      
## [46] reprex_0.3.0        cluster_2.1.0       compiler_3.6.1     
## [49] rlang_0.4.6         grid_3.6.1          gt_0.2.0.5         
## [52] rstudioapi_0.11     htmlwidgets_1.5.1   base64enc_0.1-3    
## [55] rmarkdown_2.1       gtable_0.3.0        DBI_1.1.0          
## [58] R6_2.4.1            gridExtra_2.3       lubridate_1.7.4    
## [61] knitr_1.28          commonmark_1.7      rprojroot_1.3-2    
## [64] stringi_1.4.6       rmdformats_0.3.7    Rcpp_1.0.4.6       
## [67] vctrs_0.3.0         rpart_4.1-15        acepack_1.4.1      
## [70] png_0.1-7           dbplyr_1.4.2        tidyselect_1.1.0   
## [73] xfun_0.12

6 Univariate distributions

Univariate summary CRASH-2 dataset

6.1 Data set overview

Using Hmisc describe function, provide an overview of the data set is provided including histograms of continuous variables.

6.1.1 Demographic variables

TODO: Should we plot the marginal distribution of the outcome?

Demographic variables

2 Variables   20207 Observations

age: Age years
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
202034840.99934.5615.5518192430435564
lowest : 1 14 15 16 17 , highest: 92 94 95 96 99
sex: Sex
nmissingdistinct
2020612
 Value        male female
 Frequency   16935   3271
 Proportion  0.838  0.162
 

6.1.2 Physiological measurements

Physiological measurements

5 Variables   20207 Observations

sbp: Systolic Blood Pressure mmHg
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
198873201730.98998.4527.86 60 70 80 95110130143
lowest : 4 10 12 20 25 , highest: 225 230 234 240 250
hr: Heart Rate /min
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
200701371730.996104.523.38 70 80 90105120130140
lowest : 3 4 5 6 10 , highest: 190 192 198 200 220
rr: Respiratory Rate /min
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
20016191680.9923.067.05214162022263035
lowest : 1 2 3 4 5 , highest: 90 91 94 95 96
gcs: Glasgow Coma Score Total points
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
2018423130.86312.473.594 4 61115151515
lowest : 3 4 5 6 7 , highest: 11 12 13 14 15
 Value          3     4     5     6     7     8     9    10    11    12    13    14
 Frequency    784   520   441   584   733   576   504   663   586   951  1356  2140
 Proportion 0.039 0.026 0.022 0.029 0.036 0.029 0.025 0.033 0.029 0.047 0.067 0.106
                 
 Value         15
 Frequency  10346
 Proportion 0.513
 

cc: Central Capillary Refille Time s
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
19596611200.9453.2671.671223456
lowest : 1 2 3 4 5 , highest: 17 18 20 30 60
 Value          1     2     3     4     5     6     7     8     9    10    11    12
 Frequency   1510  5328  6020  3367  1805   802   268   271    45   139     3     7
 Proportion 0.077 0.272 0.307 0.172 0.092 0.041 0.014 0.014 0.002 0.007 0.000 0.000
                                                           
 Value         13    15    16    17    18    20    30    60
 Frequency      3    19     3     1     1     2     1     1
 Proportion 0.000 0.001 0.000 0.000 0.000 0.000 0.000 0.000
 

6.1.3 Characteristics of injury

Characteristics of injury

2 Variables   20207 Observations

injurytime: Hours Since Injury hours
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
2019611930.9722.8442.350.51.01.02.04.06.07.0
lowest : 0.10 0.15 0.20 0.25 0.30 , highest: 22.00 45.00 48.00 72.00 96.00
injurytype: Injury type
image
nmissingdistinct
2020703
 Value                      blunt           penetrating blunt and penetrating
 Frequency                  11189                  6552                  2466
 Proportion                 0.554                 0.324                 0.122
 

6.2 Categorical plots

A closer examination of the categorical predictors.

6.2.1 Categorical ordinal plots

The Glasgow coma score, an ordinal categorical variable, is also displayed separately.

6.3 Continuous plots

A closer examination of continuous predictors.

There is evidence of digit preference. Explore further with targeted summaries.

More detailed univariate summaries for the variables of interest are also provided below.

6.3.1 Age

Distribution of subject age [years]

Distribution of subject age [years]

6.3.2 Blood pressure

Distribution of SBP

Distribution of SBP

6.3.3 Respiratory rate

Distribution of respiratory rate

Distribution of respiratory rate

6.3.4 Heart rate

Distribution of heart rate

Distribution of heart rate

6.3.5 Central capillary refill time

Distribution of Central capillary refill time

Distribution of Central capillary refill time

6.3.6 Hours since injury

Distribution of hours since injury

Distribution of hours since injury

6.4 Session info

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] Hmisc_4.4-0     Formula_1.2-3   survival_3.2-3  lattice_0.20-40
##  [5] forcats_0.5.0   stringr_1.4.0   dplyr_0.8.5     purrr_0.3.4    
##  [9] readr_1.3.1     tidyr_1.0.2     tibble_3.0.1    ggplot2_3.3.0  
## [13] tidyverse_1.3.0 here_0.1       
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.1          jsonlite_1.6.1      splines_3.6.1      
##  [4] modelr_0.1.6        assertthat_0.2.1    highr_0.8          
##  [7] latticeExtra_0.6-29 cellranger_1.1.0    yaml_2.2.1         
## [10] pillar_1.4.4        backports_1.1.7     glue_1.4.1         
## [13] digest_0.6.25       RColorBrewer_1.1-2  checkmate_2.0.0    
## [16] rvest_0.3.5         colorspace_1.4-1    htmltools_0.4.0    
## [19] Matrix_1.2-18       pkgconfig_2.0.3     broom_0.5.5        
## [22] haven_2.2.0         bookdown_0.18       patchwork_1.0.0    
## [25] scales_1.1.1        jpeg_0.1-8.1        htmlTable_1.13.3   
## [28] farver_2.0.3        generics_0.0.2      ellipsis_0.3.0     
## [31] withr_2.2.0         nnet_7.3-13         cli_2.0.2          
## [34] magrittr_1.5        crayon_1.3.4        readxl_1.3.1       
## [37] evaluate_0.14       fs_1.3.2            fansi_0.4.1        
## [40] nlme_3.1-145        xml2_1.2.5          foreign_0.8-76     
## [43] tools_3.6.1         data.table_1.12.8   hms_0.5.3          
## [46] lifecycle_0.2.0     munsell_0.5.0       reprex_0.3.0       
## [49] cluster_2.1.0       compiler_3.6.1      rlang_0.4.6        
## [52] grid_3.6.1          rstudioapi_0.11     htmlwidgets_1.5.1  
## [55] base64enc_0.1-3     labeling_0.3        rmarkdown_2.1      
## [58] gtable_0.3.0        DBI_1.1.0           R6_2.4.1           
## [61] gridExtra_2.3       lubridate_1.7.4     knitr_1.28         
## [64] rprojroot_1.3-2     stringi_1.4.6       rmdformats_0.3.7   
## [67] Rcpp_1.0.4.6        vctrs_0.3.0         rpart_4.1-15       
## [70] acepack_1.4.1       png_0.1-7           dbplyr_1.4.2       
## [73] tidyselect_1.1.0    xfun_0.12

7 Bivariate distributions

This code is a continuous by example

7.1 Summary by sex

Patient characteristics
male (N=16935) female (N=3271)
Age
   Median 30.0 35.0
   Mean 33.7 38.8
   SD 13.6 16.8
   Q1, Q3 23.0, 41.0 25.0, 50.0
   Range 1.0 - 99.0 15.0 - 96.0
   N-Miss 3 1
Heart Rate
   Median 105.0 106.0
   Mean 104.3 105.2
   SD 21.2 21.0
   Q1, Q3 90.0, 120.0 92.0, 120.0
   Range 3.0 - 198.0 3.0 - 220.0
   N-Miss 95 42
Respiratory Rate
   Median 22.0 22.0
   Mean 23.1 23.0
   SD 6.8 6.6
   Q1, Q3 20.0, 26.0 20.0, 26.0
   Range 1.0 - 96.0 3.0 - 87.0
   N-Miss 143 48
Systolic Blood Pressure
   Median 95.0 90.0
   Mean 98.8 96.7
   SD 25.5 25.7
   Q1, Q3 80.0, 110.0 80.0, 110.0
   Range 4.0 - 240.0 20.0 - 250.0
   N-Miss 267 53
Characteristic male, N = 169351 female, N = 32711 (Missing), N = 11
Age 34 (13.6) 39 (16.8) 30 (NA)
Unknown 3 1 0
Heart Rate 104 (21) 105 (21) 108 (NA)
Unknown 95 42 0
Respiratory Rate 23 (7) 23 (7) 22 (NA)
Unknown 143 48 0
Systolic Blood Pressure 99 (26) 97 (26) 100 (NA)
Unknown 267 53 0
Central Capillary Refille Time 3 (2) 3 (2) 4 (NA)
Unknown 509 102 0
Glasgow Coma Score Total 12 (4) 13 (3) 14 (NA)
Unknown 19 4 0
Hours Since Injury 2.85 (2.39) 2.84 (2.67) 1.00 (NA)
Unknown 10 1 0
Injury type
blunt 8962 (53%) 2227 (68%) 0 (0%)
penetrating 5930 (35%) 621 (19%) 1 (100%)
blunt and penetrating 2043 (12%) 423 (13%) 0 (0%)

1 Statistics presented: mean (SD); n (%)

7.2 Continuous variables by sex

7.2.1 Distribution of age by sex

Distribution of age by sex

7.2.2 Distribution of systolic blood pressure by sex

Distribution of systolic blood pressure by sex

7.2.3 Distribution of heart rate by sex

Distribution of heart rate by sex

7.2.4 Distribution of respiratory rate by sex

Distribution of respiratory rate by sex

7.2.5 Distribution of central capillary refille time by sex

Distribution of centrail capillary refille time by sex

7.3 Age

7.3.3 Continuous3

## Warning: package 'patchwork' was built under R version 3.6.3
## Don't know how to automatically pick scale for object of type labelled/integer. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type labelled/integer. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type labelled/integer. Defaulting to continuous.

7.4 Scatter plots with a third or fourth variable

Scatter plot of age and RR by sex and injury type.

## Don't know how to automatically pick scale for object of type labelled/integer. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type labelled/integer. Defaulting to continuous.
## Warning: Removed 195 rows containing missing values (geom_point).

Scatter plot of SBP and RR by sex and injury type.

## Don't know how to automatically pick scale for object of type labelled/integer. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type labelled/integer. Defaulting to continuous.
## Warning: Removed 457 rows containing missing values (geom_point).

7.5 Session info

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] patchwork_1.0.0    gtsummary_1.2.6    arsenal_3.4.0      Hmisc_4.4-0       
##  [5] Formula_1.2-3      survival_3.2-3     lattice_0.20-40    summarytools_0.9.6
##  [9] janitor_2.0.1      forcats_0.5.0      stringr_1.4.0      dplyr_0.8.5       
## [13] purrr_0.3.4        readr_1.3.1        tidyr_1.0.2        tibble_3.0.1      
## [17] ggplot2_3.3.0      tidyverse_1.3.0    here_0.1          
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-145        matrixStats_0.56.0  fs_1.3.2           
##  [4] lubridate_1.7.4     RColorBrewer_1.1-2  httr_1.4.1         
##  [7] rprojroot_1.3-2     tools_3.6.1         backports_1.1.7    
## [10] R6_2.4.1            rpart_4.1-15        lazyeval_0.2.2     
## [13] DBI_1.1.0           colorspace_1.4-1    nnet_7.3-13        
## [16] withr_2.2.0         tidyselect_1.1.0    gridExtra_2.3      
## [19] compiler_3.6.1      cli_2.0.2           rvest_0.3.5        
## [22] gt_0.2.0.5          htmlTable_1.13.3    xml2_1.2.5         
## [25] plotly_4.9.2.1      labeling_0.3        sass_0.2.0         
## [28] bookdown_0.18       scales_1.1.1        checkmate_2.0.0    
## [31] commonmark_1.7      digest_0.6.25       foreign_0.8-76     
## [34] rmarkdown_2.1       base64enc_0.1-3     jpeg_0.1-8.1       
## [37] pkgconfig_2.0.3     htmltools_0.4.0     highr_0.8          
## [40] dbplyr_1.4.2        htmlwidgets_1.5.1   rlang_0.4.6        
## [43] readxl_1.3.1        rstudioapi_0.11     pryr_0.1.4         
## [46] farver_2.0.3        generics_0.0.2      jsonlite_1.6.1     
## [49] crosstalk_1.1.0.1   acepack_1.4.1       magrittr_1.5       
## [52] rapportools_1.0     Matrix_1.2-18       Rcpp_1.0.4.6       
## [55] munsell_0.5.0       fansi_0.4.1         lifecycle_0.2.0    
## [58] stringi_1.4.6       yaml_2.2.1          snakecase_0.11.0   
## [61] plyr_1.8.6          grid_3.6.1          crayon_1.3.4       
## [64] haven_2.2.0         splines_3.6.1       pander_0.6.3       
## [67] hms_0.5.3           magick_2.3          knitr_1.28         
## [70] pillar_1.4.4        tcltk_3.6.1         codetools_0.2-16   
## [73] reprex_0.3.0        glue_1.4.1          evaluate_0.14      
## [76] latticeExtra_0.6-29 data.table_1.12.8   modelr_0.1.6       
## [79] png_0.1-7           vctrs_0.3.0         rmdformats_0.3.7   
## [82] cellranger_1.1.0    gtable_0.3.0        assertthat_0.2.1   
## [85] xfun_0.12           broom_0.5.5         viridisLite_0.3.0  
## [88] cluster_2.1.0       ellipsis_0.3.0

8 Missing data

TODO: organise

Identify # complete cases and patients with missing data.

## Warning: Factor `sex` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `sex` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `sex` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `sex` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `sex` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `sex` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `sex` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `sex` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `sex` contains implicit NA, consider using
## `forcats::fct_explicit_na`

8.1 Session info

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] naniar_0.5.2    Hmisc_4.4-0     Formula_1.2-3   survival_3.2-3 
##  [5] lattice_0.20-40 forcats_0.5.0   stringr_1.4.0   dplyr_0.8.5    
##  [9] purrr_0.3.4     readr_1.3.1     tidyr_1.0.2     tibble_3.0.1   
## [13] ggplot2_3.3.0   tidyverse_1.3.0 here_0.1       
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.5.1       httr_1.4.1          viridisLite_0.3.0  
##  [4] jsonlite_1.6.1      splines_3.6.1       modelr_0.1.6       
##  [7] assertthat_0.2.1    latticeExtra_0.6-29 cellranger_1.1.0   
## [10] yaml_2.2.1          pillar_1.4.4        backports_1.1.7    
## [13] visdat_0.5.3        glue_1.4.1          digest_0.6.25      
## [16] RColorBrewer_1.1-2  checkmate_2.0.0     rvest_0.3.5        
## [19] colorspace_1.4-1    plyr_1.8.6          htmltools_0.4.0    
## [22] Matrix_1.2-18       pkgconfig_2.0.3     broom_0.5.5        
## [25] haven_2.2.0         bookdown_0.18       scales_1.1.1       
## [28] jpeg_0.1-8.1        htmlTable_1.13.3    farver_2.0.3       
## [31] generics_0.0.2      ellipsis_0.3.0      UpSetR_1.4.0       
## [34] withr_2.2.0         nnet_7.3-13         cli_2.0.2          
## [37] magrittr_1.5        crayon_1.3.4        readxl_1.3.1       
## [40] evaluate_0.14       fs_1.3.2            fansi_0.4.1        
## [43] nlme_3.1-145        xml2_1.2.5          foreign_0.8-76     
## [46] tools_3.6.1         data.table_1.12.8   hms_0.5.3          
## [49] lifecycle_0.2.0     munsell_0.5.0       reprex_0.3.0       
## [52] cluster_2.1.0       compiler_3.6.1      rlang_0.4.6        
## [55] grid_3.6.1          rstudioapi_0.11     htmlwidgets_1.5.1  
## [58] labeling_0.3        base64enc_0.1-3     rmarkdown_2.1      
## [61] gtable_0.3.0        DBI_1.1.0           R6_2.4.1           
## [64] gridExtra_2.3       lubridate_1.7.4     knitr_1.28         
## [67] rprojroot_1.3-2     stringi_1.4.6       rmdformats_0.3.7   
## [70] Rcpp_1.0.4.6        vctrs_0.3.0         rpart_4.1-15       
## [73] acepack_1.4.1       png_0.1-7           dbplyr_1.4.2       
## [76] tidyselect_1.1.0    xfun_0.12